CS2006 Python Practical 2¶
Tutor: Stephen Linton
2024-3-29
- Student A - 230015014
- Student B - 220024634
- Student C - 220010065
In this practical, we are given a dataset containing a sample of 1% of people in the 2011 Census database for England and Wales. We are asked to analyze this dataset using programs written by us and using components of the Python ecosystem. Our project thoroughly refines the data, describes it accurately, is repeatable, replicable, reproducible and resuable. All analytics can be executed with any data set of similar structure.
import pandas as pd
import sys
import os
sys.path.append("../code")
import consistency
Refining the dataset¶
We start with exploring the content of the raw data.
df = pd.read_csv("../data/census2011.csv")
df
| Person ID | Region | Residence Type | Family Composition | Population Base | Sex | Age | Marital Status | Student | Country of Birth | Health | Ethnic Group | Religion | Economic Activity | Occupation | Industry | Hours worked per week | Approximated Social Grade | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7394816 | E12000001 | H | 2 | 1 | 2 | 6 | 2 | 2 | 1 | 2 | 1 | 2 | 5 | 8 | 2 | -9 | 4 |
| 1 | 7394745 | E12000001 | H | 5 | 1 | 1 | 4 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 8 | 6 | 4 | 3 |
| 2 | 7395066 | E12000001 | H | 3 | 1 | 2 | 4 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 6 | 11 | 3 | 4 |
| 3 | 7395329 | E12000001 | H | 3 | 1 | 2 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 7 | 7 | 3 | 2 |
| 4 | 7394712 | E12000001 | H | 3 | 1 | 1 | 5 | 4 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 4 | 3 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 569736 | 7946020 | W92000004 | H | 1 | 1 | 1 | 5 | 1 | 2 | 1 | 4 | 1 | 9 | 1 | 8 | 8 | 3 | 3 |
| 569737 | 7944310 | W92000004 | H | 3 | 1 | 1 | 3 | 1 | 2 | 1 | 2 | 1 | 1 | 1 | 7 | 4 | 3 | 4 |
| 569738 | 7945374 | W92000004 | H | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | -9 | -9 | -9 | -9 | -9 |
| 569739 | 7944768 | W92000004 | H | 1 | 1 | 2 | 8 | 5 | 2 | 1 | 3 | 1 | 9 | 5 | 9 | 2 | -9 | 4 |
| 569740 | 7944959 | W92000004 | H | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 2 | 1 | 1 | 1 | 7 | 4 | 1 | 4 |
569741 rows × 18 columns
We will first refine the data in order to handle inconsistencies before further analysis.
We consider inconsistencies to be:
- 0-15 age range and any form of marital status other than single
- Any mismatched 'no codes' to do with student status
- Anyone marked as working and in very bad health
- Anyone with very bad health who is not marked as sick or disabled
consistency.cleanDataFrame(df)
Checking for problem values...
Value checking finished.
Checking types...
Discrepancy of type in column Residence Type expected string found object
Type checking finished.
Retyping columns ['Residence Type'] ...
Retyping Residence Type from <class 'str'> to string
Retyping finished
>> Checking Age == 0-15
Requirement: [<MaritalStatusOptions: 1 -> SINGLE>] - CONTRADICTION
Age Age DESC Marital Status \
26774 1 0-15 2
26821 1 0-15 2
207835 1 0-15 2
452434 1 0-15 2
467282 1 0-15 2
480533 1 0-15 2
499946 1 0-15 2
511216 1 0-15 2
546848 1 0-15 2
554079 1 0-15 4
555682 1 0-15 5
Marital Status DESC
26774 Married or in a registered same-sex civil part...
26821 Married or in a registered same-sex civil part...
207835 Married or in a registered same-sex civil part...
452434 Married or in a registered same-sex civil part...
467282 Married or in a registered same-sex civil part...
480533 Married or in a registered same-sex civil part...
499946 Married or in a registered same-sex civil part...
511216 Married or in a registered same-sex civil part...
546848 Married or in a registered same-sex civil part...
554079 Divorced or formely in a same-sex civil partne...
555682 Widowed or surviving partner from a same-sexci...
Requirement: [<SocialGradeOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<HoursWorkedPerWeekOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<IndustryOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<OccupationOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<EconomicActivityOptions: -9 -> NO_CODE>] - HOLDS
>> Checking Population Base == Student living away from home during term-time
Requirement: [<FamilyCompositionOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<CountryOfBirthOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<HealthOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<EthnicityOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<ReligionOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<EconomicActivityOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<OccupationOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<IndustryOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<HoursWorkedPerWeekOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<SocialGradeOptions: -9 -> NO_CODE>] - HOLDS
>> Checking Health == Very bad health
Requirement: [<EconomicActivityOptions: 8 -> SICK_OR_DISABLED>, <EconomicActivityOptions: 5 -> RETIRED>, <EconomicActivityOptions: 6 -> STUDENT_INACTIVE>] - CONTRADICTION
Health Health DESC Economic Activity \
1025 5 Very bad health 2
1335 5 Very bad health 9
1958 5 Very bad health 9
2054 5 Very bad health 2
2477 5 Very bad health 7
... ... ... ...
568346 5 Very bad health 9
568805 5 Very bad health 1
569045 5 Very bad health 1
569248 5 Very bad health 9
569637 5 Very bad health 9
Economic Activity DESC
1025 Economically active: Self-employed
1335 Economically inactive: Other
1958 Economically inactive: Other
2054 Economically active: Self-employed
2477 Economically inactive: Looking after home or f...
... ...
568346 Economically inactive: Other
568805 Economically active: Employee
569045 Economically active: Employee
569248 Economically inactive: Other
569637 Economically inactive: Other
[1186 rows x 4 columns]
>> Checking Student == No
Requirement: [<CountryOfBirthOptions: 1 -> UK>, <CountryOfBirthOptions: 2 -> NON_UK>] - HOLDS
>> Checking Student == Yes
Requirement: [<EconomicActivityOptions: 4 -> FULL_TIME_STUDENT>, <EconomicActivityOptions: 6 -> STUDENT_INACTIVE>, <EconomicActivityOptions: -9 -> NO_CODE>] - HOLDS
>> Checking Economic Activity == Economically inactive: Retired
Requirement: [<HoursWorkedPerWeekOptions: -9 -> NO_CODE>] - HOLDS
>> Checking Economic Activity == Economically active: Unemployed
Requirement: [<HoursWorkedPerWeekOptions: -9 -> NO_CODE>] - HOLDS
>> Checking Residence Type == Resident in a communal establishment
Requirement: [<FamilyCompositionOptions: -9 -> NO_CODE>] - HOLDS
Requirement: [<SocialGradeOptions: -9 -> NO_CODE>] - HOLDS
| Person ID | Region | Residence Type | Family Composition | Population Base | Sex | Age | Marital Status | Student | Country of Birth | Health | Ethnic Group | Religion | Economic Activity | Occupation | Industry | Hours worked per week | Approximated Social Grade | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7394816 | E12000001 | H | 2 | 1 | 2 | 6 | 2 | 2 | 1 | 2 | 1 | 2 | 5 | 8 | 2 | -9 | 4 |
| 1 | 7394745 | E12000001 | H | 5 | 1 | 1 | 4 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 8 | 6 | 4 | 3 |
| 2 | 7395066 | E12000001 | H | 3 | 1 | 2 | 4 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 6 | 11 | 3 | 4 |
| 3 | 7395329 | E12000001 | H | 3 | 1 | 2 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 7 | 7 | 3 | 2 |
| 4 | 7394712 | E12000001 | H | 3 | 1 | 1 | 5 | 4 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 4 | 3 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 569736 | 7946020 | W92000004 | H | 1 | 1 | 1 | 5 | 1 | 2 | 1 | 4 | 1 | 9 | 1 | 8 | 8 | 3 | 3 |
| 569737 | 7944310 | W92000004 | H | 3 | 1 | 1 | 3 | 1 | 2 | 1 | 2 | 1 | 1 | 1 | 7 | 4 | 3 | 4 |
| 569738 | 7945374 | W92000004 | H | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | -9 | -9 | -9 | -9 | -9 |
| 569739 | 7944768 | W92000004 | H | 1 | 1 | 2 | 8 | 5 | 2 | 1 | 3 | 1 | 9 | 5 | 9 | 2 | -9 | 4 |
| 569740 | 7944959 | W92000004 | H | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 2 | 1 | 1 | 1 | 7 | 4 | 1 | 4 |
569741 rows × 18 columns
df["Residence Type"]
0 H
1 H
2 H
3 H
4 H
..
569736 H
569737 H
569738 H
569739 H
569740 H
Name: Residence Type, Length: 569741, dtype: string
Then, save the cleaned data to a separate file so we can reuse it later.
cleanPath = "../data/census2011-clean.csv"
df.to_csv(cleanPath) # save to csv
df = pd.read_csv(cleanPath) # set dataframe to cleaned data
To recreate this step, navigate to the parent directory, then execute the ./run_consistency script which takes a csv path as a parameter
Design: Refining the data - Students A & B¶
Initially, we simply enumerated the possible values of each column and tested each column. This was a very simplistic approach, and allowed us to rapidly evaluate the quality of the data. By first doing a quick analysis of the data, we were able to make an informed decision of how to handle invalid data. Since there were no invalid values we decided that future datasets would be unlikely to have a large amount of invalid data, and so we decided to remove any invalid rows from the data set. If there were a large number of invalid rows, this could cause issues as the sample used for analysis may not be fully representative of the original data, and could lead us to draw invalid conclusions.
We wanted to make cleaning and verification data extensible to other data sets,
but our current way would need to be completely rewritten for a new data set
with new columns. Therefore, we developed OptionEnum, that extends Enum,
and stores a mapping of key to description. We can now easily work with the
data set, listing all possible values with their descriptions as well as parsing.
import census_microdata_2011 as md
[f"{x.key()}: {x.desc()}" for x in md.EthnicityOptions]
['1: White', '2: Mixed', '3: Asian or Asian British', '4: Black or Black British', '5: Chinese or Other ethnic group', '-9: No code required (Not resident in england or wales, students or schoolchildren living away during term-time)']
We can also use this to easily search for a particular value in the dataset
df.loc[df["Age"] == md.AgeOptions.FROM_35_TO_44.key()]
| Unnamed: 0 | Person ID | Region | Residence Type | Family Composition | Population Base | Sex | Age | Marital Status | Student | Country of Birth | Health | Ethnic Group | Religion | Economic Activity | Occupation | Industry | Hours worked per week | Approximated Social Grade | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 7394745 | E12000001 | H | 5 | 1 | 1 | 4 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 8 | 6 | 4 | 3 |
| 2 | 2 | 7395066 | E12000001 | H | 3 | 1 | 2 | 4 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 6 | 11 | 3 | 4 |
| 6 | 6 | 7394871 | E12000001 | H | 5 | 1 | 2 | 4 | 3 | 2 | 1 | 2 | 1 | 1 | 1 | 6 | 11 | 2 | 3 |
| 18 | 18 | 7395059 | E12000001 | H | 1 | 1 | 1 | 4 | 1 | 2 | 1 | 3 | 1 | 1 | 1 | 8 | 2 | 3 | 4 |
| 22 | 22 | 7394857 | E12000001 | H | 2 | 1 | 1 | 4 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 8 | 2 | 3 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 569685 | 569685 | 7944687 | W92000004 | H | 2 | 1 | 1 | 4 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 8 | 4 | 3 | 4 |
| 569693 | 569693 | 7945171 | W92000004 | H | 2 | 1 | 2 | 4 | 2 | 2 | 1 | 1 | 1 | 2 | 1 | 4 | 4 | 2 | 2 |
| 569706 | 569706 | 7946284 | W92000004 | H | 1 | 1 | 2 | 4 | 1 | 2 | 1 | 1 | 1 | 3 | 1 | 3 | 11 | 3 | 2 |
| 569725 | 569725 | 7945073 | W92000004 | H | 1 | 1 | 2 | 4 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 4 | 11 | 3 | 2 |
| 569733 | 569733 | 7944827 | W92000004 | H | 5 | 1 | 2 | 4 | 4 | 2 | 1 | 1 | 1 | 1 | 1 | 6 | 10 | 2 | 4 |
78641 rows × 19 columns
And easily translate the cryptic key names into the descriptive strings
df["Residence Type"].replace(md.ResidenceOptions.mappings)
0 Not resident in a communal establishment
1 Not resident in a communal establishment
2 Not resident in a communal establishment
3 Not resident in a communal establishment
4 Not resident in a communal establishment
...
569736 Not resident in a communal establishment
569737 Not resident in a communal establishment
569738 Not resident in a communal establishment
569739 Not resident in a communal establishment
569740 Not resident in a communal establishment
Name: Residence Type, Length: 569741, dtype: object
Descriptive analysis of cleaned data - Student B¶
For this basic requirement we were asked to obtain:
- the total number of records in the dataset
- the type of each variable in the dataset
- all different values that each variable takes and the number of occurences for each value (excluding Person ID)
To encapsulate this entire requirement, we implemented printSummary
import stats as s
s.printSummary(df)
Number of Records: 569741 Column types----------- Person ID int64 Region object Residence Type object Family Composition int64 Population Base int64 Sex int64 Age int64 Marital Status int64 Student int64 Country of Birth int64 Health int64 Ethnic Group int64 Religion int64 Economic Activity int64 Occupation int64 Industry int64 Hours worked per week int64 Approximated Social Grade int64 dtype: object Region E12000008 E12000007 E12000002 E12000006 E12000005 E12000009 E12000003 E12000004 W92000004 E12000001 count 88084 83582 71436 59411 56875 53774 53471 45782 30977 26349 Residence Type H C count 559087 10654 Family Composition 2 1 3 5 -9 4 6 count 300962 96690 72641 64519 18851 9848 6230 Population Base 1 2 3 count 561040 6730 1971 Sex 2 1 count 289172 280569 Age 1 4 5 3 2 6 7 8 count 106832 78641 77388 75948 72785 65666 48777 43704 Marital Status 1 2 4 5 3 count 270999 214180 40713 31898 11951 Student 2 1 count 443204 126537 Country of Birth 1 2 -9 count 485645 77292 6804 Health 1 2 3 4 5 -9 count 264971 191744 74480 24558 7184 6804 Ethnic Group 1 3 4 2 -9 5 count 483477 42712 18786 12209 6804 5753 Religion 2 1 9 6 4 -9 7 5 3 8 count 333481 141658 40613 27240 8214 6804 4215 2572 2538 2406 Economic Activity 1 -9 5 2 6 3 8 7 4 9 count 216025 112618 97480 40632 24756 18109 17991 17945 14117 10068 Occupation -9 2 9 4 5 3 1 7 6 8 count 149984 64111 58483 53254 48546 44937 39788 38523 37297 34818 Industry -9 4 2 8 11 10 6 3 5 9 12 7 1 count 149984 68878 53433 49960 49345 40560 35240 30708 25736 24908 20256 16776 3957 Hours worked per week -9 3 2 4 1 count 302321 153938 52133 35573 25776 Approximated Social Grade 2 -9 4 1 3 count 159642 124103 123740 82320 79936
To see the counts of an individual column, use getUniqueCounts
s.getUniqueCounts(df["Country of Birth"])
| Country of Birth | count | |
|---|---|---|
| 0 | 1 | 485645 |
| 1 | 2 | 77292 |
| 2 | -9 | 6804 |
To recreate this step, navigate to the parent directory, then execute the ./run_summary script which takes a csv path as a parameter
The second part of the descriptive analysis, we were told to build the following plots:
- bar chart for the number of records for each region
- bar chart for the number of records for each occupation
- pie chart for the distribution of the sample by age
- pie chart for the distribution of the sample by the economic activity.
%matplotlib inline
import basic_plots as b
out1 = b.Output()
out2 = b.Output()
out3 = b.Output()
out4 = b.Output()
b.interact(b.genRecordBarPlot, df=b.fixed(df), colName=b.Dropdown(options=df.columns, value='Region'), save=b.fixed(False), _output=out1)
b.interact(b.genRecordBarPlot, df=b.fixed(df), colName=b.Dropdown(options=df.columns, value='Occupation'), save=b.fixed(False), _output=out2)
b.interact(b.genDistPieChart, df=b.fixed(df), colName=b.Dropdown(options=df.columns, value='Age'), save=b.fixed(False), _output=out3)
b.interact(b.genDistPieChart, df=b.fixed(df), colName=b.Dropdown(options=df.columns, value='Economic Activity'), save=b.fixed(False), _output=out4)
display(out1)
display(out2)
display(out3)
display(out4)
interactive(children=(Dropdown(description='colName', index=2, options=('Unnamed: 0', 'Person ID', 'Region', '…
interactive(children=(Dropdown(description='colName', index=15, options=('Unnamed: 0', 'Person ID', 'Region', …
interactive(children=(Dropdown(description='colName', index=7, options=('Unnamed: 0', 'Person ID', 'Region', '…
interactive(children=(Dropdown(description='colName', index=14, options=('Unnamed: 0', 'Person ID', 'Region', …
Output()
Output()
Output()
Output()
To recreate this step, navigate to the parent directory, then execute the ./run_plots script which takes a csv path as a parameter and assumes the existence of "Region", "Occupation", "Age" and "Economic Activity" as columns. When running the script, the plots will be saved as png images in the images directory.
Using groupby to produce tables - Student B¶
We were asked to produce the following tables:
- number of records by region and industry
- number of records by occupation and social grade
To make this functionality easy to reuse, we wrote one function, getGroupTable which takes a dataframe and two column names and produces a table showing the number of records for the pair of columns in the given dataframe. This can be done with any pair of columns in the given dataframe.
s.getGroupTable(df, "Region", "Industry")
| Region | Industry | counts | |
|---|---|---|---|
| 0 | E12000001 | -9 | 6854 |
| 1 | E12000001 | 4 | 3087 |
| 2 | E12000001 | 2 | 2851 |
| 3 | E12000001 | 11 | 2524 |
| 4 | E12000001 | 8 | 1883 |
| ... | ... | ... | ... |
| 125 | W92000004 | 5 | 1641 |
| 126 | W92000004 | 6 | 1500 |
| 127 | W92000004 | 12 | 992 |
| 128 | W92000004 | 7 | 594 |
| 129 | W92000004 | 1 | 403 |
130 rows × 3 columns
s.getGroupTable(df, "Occupation", "Approximated Social Grade")
| Occupation | Approximated Social Grade | counts | |
|---|---|---|---|
| 0 | -9 | -9 | 116915 |
| 1 | -9 | 2 | 17787 |
| 2 | -9 | 4 | 12169 |
| 3 | -9 | 3 | 2062 |
| 4 | -9 | 1 | 1051 |
| 5 | 1 | 1 | 19190 |
| 6 | 1 | 2 | 18555 |
| 7 | 1 | 4 | 967 |
| 8 | 1 | 3 | 584 |
| 9 | 1 | -9 | 492 |
| 10 | 2 | 1 | 48104 |
| 11 | 2 | 2 | 13223 |
| 12 | 2 | 4 | 1009 |
| 13 | 2 | 3 | 891 |
| 14 | 2 | -9 | 884 |
| 15 | 3 | 2 | 35435 |
| 16 | 3 | 1 | 7050 |
| 17 | 3 | 4 | 986 |
| 18 | 3 | -9 | 819 |
| 19 | 3 | 3 | 647 |
| 20 | 4 | 2 | 44922 |
| 21 | 4 | 1 | 3000 |
| 22 | 4 | 3 | 2353 |
| 23 | 4 | 4 | 2252 |
| 24 | 4 | -9 | 727 |
| 25 | 5 | 3 | 37190 |
| 26 | 5 | 4 | 7629 |
| 27 | 5 | 2 | 2464 |
| 28 | 5 | -9 | 678 |
| 29 | 5 | 1 | 585 |
| 30 | 6 | 3 | 15555 |
| 31 | 6 | 4 | 13860 |
| 32 | 6 | 2 | 6343 |
| 33 | 6 | 1 | 1061 |
| 34 | 6 | -9 | 478 |
| 35 | 7 | 4 | 21347 |
| 36 | 7 | 2 | 12184 |
| 37 | 7 | 3 | 2997 |
| 38 | 7 | -9 | 1031 |
| 39 | 7 | 1 | 964 |
| 40 | 8 | 4 | 21088 |
| 41 | 8 | 3 | 11157 |
| 42 | 8 | 2 | 1719 |
| 43 | 8 | -9 | 441 |
| 44 | 8 | 1 | 413 |
| 45 | 9 | 4 | 42433 |
| 46 | 9 | 2 | 7010 |
| 47 | 9 | 3 | 6500 |
| 48 | 9 | -9 | 1638 |
| 49 | 9 | 1 | 902 |
An example of reuse of this method:
s.getGroupTable(df, "Student", "Religion")
| Student | Religion | counts | |
|---|---|---|---|
| 0 | 1 | 2 | 60401 |
| 1 | 1 | 1 | 35488 |
| 2 | 1 | 6 | 10398 |
| 3 | 1 | 9 | 8660 |
| 4 | 1 | -9 | 6804 |
| 5 | 1 | 4 | 2113 |
| 6 | 1 | 7 | 1091 |
| 7 | 1 | 5 | 624 |
| 8 | 1 | 3 | 607 |
| 9 | 1 | 8 | 351 |
| 10 | 2 | 2 | 273080 |
| 11 | 2 | 1 | 106170 |
| 12 | 2 | 9 | 31953 |
| 13 | 2 | 6 | 16842 |
| 14 | 2 | 4 | 6101 |
| 15 | 2 | 7 | 3124 |
| 16 | 2 | 8 | 2055 |
| 17 | 2 | 5 | 1948 |
| 18 | 2 | 3 | 1931 |
Queries with pandas - Student C¶
We were asked to perform queries on the dataframe to find:
- the number of economically active people by region
- the number of economically active people by age
- any discrepancies between student status and economic activity
- the number of working hours per week for students
%matplotlib widget
import queries as q
import threed_plots as p
q.find_query1(df)
p.plotScatter(s.getGroupTable(df, "Region", "Age"), "Region", "Economic Activity")
Number of economically active people by region: Region E12000001 21371 E12000002 57513 E12000003 43073 E12000004 36861 E12000005 45258 E12000006 47674 E12000007 66212 E12000008 70306 E12000009 43807 W92000004 25048 Name: Person ID, dtype: int64
interactive(children=(Dropdown(description='Region', options=('E12000001', 'E12000002', 'E12000003', 'E1200000…
q.find_query2(df)
p.plotScatter(s.getGroupTable(df, "Region", "Age"), "Region", "Age")
Number of economically active people by age: Region E12000001 21371 E12000002 57513 E12000003 43073 E12000004 36861 E12000005 45258 E12000006 47674 E12000007 66212 E12000008 70306 E12000009 43807 W92000004 25048 Name: Person ID, dtype: int64
interactive(children=(Dropdown(description='Region', options=('E12000001', 'E12000002', 'E12000003', 'E1200000…
q.find_discrepancies(df)
Discrepancies found between student status and economic activity: Economic Activity -9 88582 6 23838 4 14117 Name: count, dtype: int64
q.find_hours(df)
Working hours found per week for students: -227422
3D plots - Student B w/ ipywidgets - Student C¶
The following 3D plots are made using the previously described getGroupTable method. There are two methods to generate them, plotScatter which produces a 3D scatter plot and plotSurface which produces a 3D surface plot.
p.plotScatter(s.getGroupTable(df, "Region", "Industry"), "Region", "Industry")
p.plotSurface(s.getGroupTable(df, "Region", "Industry"), "Region", "Industry")
p.plotScatter(s.getGroupTable(df, "Occupation", "Approximated Social Grade"), "Occupation", "Approximated Social Grade")
p.plotSurface(s.getGroupTable(df, "Occupation", "Approximated Social Grade"), "Occupation", "Approximated Social Grade")
interactive(children=(Dropdown(description='Region', options=('E12000001', 'E12000002', 'E12000003', 'E1200000…
interactive(children=(Dropdown(description='Region', options=('E12000001', 'E12000002', 'E12000003', 'E1200000…
interactive(children=(Dropdown(description='Occupation', options=(1, 2, 3, 4, 5, 6, 7, 8, 9, -9), value=1), Dr…
interactive(children=(Dropdown(description='Occupation', options=(1, 2, 3, 4, 5, 6, 7, 8, 9, -9), value=1), Dr…
Mapping by region - Student B¶
To map the data for each region on a map, we used the folium library to generate the map elements along with the Find that Postcode API to generate the region borders. This functionality is encapsulated in the method plotMap which takes a dataframe and a column whose data is to be visualized.
On the map itself, the data is visualized in multiple ways. Firstly, there is a colorscale representing the average value obtained from a given region. This is particularly useful when there is a correlation between the key and value (ie: Age). Secondly, when hovering over a particular region, you can view a table of the number of records for that region. To translate this information, there is a legend containing the keys and their associated values. For convenience, the legend can be dragged around the map.
import map_plot as m
plot = m.plotMap(df, "Age")
plot
Generating region by Age map... Done.
plot = m.plotMap(df, "Marital Status")
plot
Generating region by Marital Status map... Done.
Performance Analysis and Optimisation - Student A¶
One of the Hard requirements was to analyse the performance of different steps of our analysis. To do this I used the timeit library and created performance tests over a range of data set sizes.
I identified two problematic steps: validation and parsing.
Benchmarking with timeit¶
In order to compare the optimised and unoptimised steps of the data analysis, I used the built in timeit library in python along with the census data already provided. I ran the steps on different numbers of rows from the data set, ranging from 10 rows to 400000. This allowed me not only to see whether the algorithm had sped up, but predict how it would behave on even larger data sets that our code could be used for in future.
In order to make increase the reliability of the results, I run the steps multiple times. I run the unoptimised variants 3 times, and the optimised variants 10 times, due to the large disparity in time taken to run each.
Validation¶
The validation step is the step that checks that the entire data frame for any invalid data, and reports rows that are invalid. This was immediately observed as being slow from when the validation code was first made.
Our original implementation used the naive approach of iterating through the data frame, and checking that the encoded was one of the permitted values.
Experimenting with pandas, I found the method Series.isin(values), which
produces a new series with True/False values of whether each value was in
the given set of values. I changed the method to use this which also allowed
us to easily see which row numbers contained the problematic values.
When I changed to this I immediately saw a huge performance improvement, which I later benchmarked (see below).
Parsing¶
The parsing step is the step that converted the encoded data into the long
human readable descriptions. As I learned from the previous step that
pandas is much faster than iteration, I used Series.apply(func) to apply
a mapping function that parsed each value. However, this was still not very
fast, and although it was unlikely to ever be used on the whole data frame,
it was still slow on subsets.
I tried multiple different approaches, such as using a dictionary inside the
parsing function, but this only improved the performance marginally.
After lots of experimentation, I wondered whether the .apply() in pandas
was not the best way to perform the transformation. I discovered that there
was indeed a method to elimate this, .replace(), which takes a dictionary
that maps from the key to a value. By using this I was able to see
a massive performance improvement.
Running the benchmarks¶
To run the benchmarks, the script ./run_performance can be used, which
will generate graphs in the images/performance directory.
The benchmark can also be run below - normally it would be a bad idea to run benchmarks in a Jupyter notebook, but in this case the performance difference is so extreme that it should overshadow any noise.
#%matplotlib inline
import performance
performance.profile_and_plot(df)
10 Rows: Simple Iteration...
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[22], line 4 1 #%matplotlib inline 3 import performance ----> 4 performance.profile_and_plot(df) File ~/Desktop/CS Work/CS2006/Python2/notebooks/../code/performance.py:85, in profile_and_plot(df, save) 83 print("Data set is not atleast 400000, this will mean graphs are incorrect") 84 cur = {} ---> 85 profileSize(df, 10, cur) 86 profileSize(df, 100, cur) 87 profileSize(df, 500, cur) File ~/Desktop/CS Work/CS2006/Python2/notebooks/../code/performance.py:59, in profileSize(df, size, cur) 57 def profileSize(df, size, cur): 58 print(f"{size} Rows: ", flush=True, end="") ---> 59 result = profileWith(df.head(size)) 60 for k,v in result.items(): 61 pair = (size, v) File ~/Desktop/CS Work/CS2006/Python2/notebooks/../code/performance.py:44, in profileWith(df) 42 results = {} 43 print("Simple Iteration... ", flush=True, end="") ---> 44 results["iterate_df_in_values"] = profileMethod(lambda: iterate_df_in_values(df), 3) 45 print("Current implementation... ", flush=True, end="") 46 results["current_impl"] = profileMethod(lambda: cur_impl(df), 10) File ~/Desktop/CS Work/CS2006/Python2/notebooks/../code/performance.py:55, in profileMethod(f, number) 54 def profileMethod(f, number): ---> 55 return timeit.timeit(f, number=number) / number File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/timeit.py:237, in timeit(stmt, setup, timer, number, globals) 234 def timeit(stmt="pass", setup="pass", timer=default_timer, 235 number=default_number, globals=None): 236 """Convenience function to create Timer object and call timeit method.""" --> 237 return Timer(stmt, setup, timer, globals).timeit(number) File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/timeit.py:180, in Timer.timeit(self, number) 178 gc.disable() 179 try: --> 180 timing = self.inner(it, self.timer) 181 finally: 182 if gcold: File <timeit-src>:6, in inner(_it, _timer, _stmt) File ~/Desktop/CS Work/CS2006/Python2/notebooks/../code/performance.py:44, in profileWith.<locals>.<lambda>() 42 results = {} 43 print("Simple Iteration... ", flush=True, end="") ---> 44 results["iterate_df_in_values"] = profileMethod(lambda: iterate_df_in_values(df), 3) 45 print("Current implementation... ", flush=True, end="") 46 results["current_impl"] = profileMethod(lambda: cur_impl(df), 10) File ~/Desktop/CS Work/CS2006/Python2/notebooks/../code/performance.py:15, in iterate_df_in_values(df) 13 invalid_columns[column] = invalid_rows 14 col = md.colMap.get(column) ---> 15 if col.options is None: 16 continue 17 for i, cell in enumerate(df[column]): AttributeError: 'NoneType' object has no attribute 'options'
Benchmarking Results¶
Validation¶
We can clearly see that the pandas' isin() method massively outperforms
iteration.
At 400000 rows, isin() is over 20x faster: 0.044s vs 1.045s.
From the graph it appears this is only a constant improvement - both algorithms seem to have O(n) complexity, which means that with a large enough data set, the validation step could still take a long time. With our dataset however, it goes from being slightly slow to immediate, which is a much appreciated improvement.
Parsing¶
The results from the parsing step is similar to the validation step.
The .replace() method massively outperforms .apply().
At 400000 rows, .replace() is over 25x faster (0.49s vs 13.7s)
Again, the complexity of both algorithms seems to be the same - O(n).
Lessons and Recommendations¶
The main recommendations from this experience is to
- Use panda's built-in methods whenever possible
- Prefer passing primitives (i.e. lists, sets) instead of functions
I believe the reason for this is that pandas is based on numpy. Numpy is designed to perform operations on multiple columns at the same time, and is partly written in C. Therefore, in order to access the best performance we need to pass arguments that can be easily translated into C, such as the primitives in python. This then allows pandas to use numpy effectively, without having to repeatedly cross the C barrier.
This article discusses the issue slightly: https://labs.quansight.org/blog/unlocking-c-level-performance-in-df-apply
Unit Testing¶
For this practical, we did not see many invalid values in the data so we needed some other way to test that our input validation was working correctly. In order to do this, we created tests that checked various permitted and disallowed values by calling the same functions as our verification code.
We worked on tests and encoding of the variables separately, which meant that the chance of anything being missed by both was very low. It also gave us confidence that when we changed the parsing code to make it more extensible and optimise it, that we would know if anything was missed. In fact, we caught a couple of mistakes using these unit tests.
Our unit tests can be run with the ./test.sh script
import test as tests
tests.test()
Bibliography¶
https://focaalvarez.medium.com/mapping-the-uk-and-navigating-the-post-code-maze-4898e758b82f
https://medium.com/@patohara60/interactive-mapping-in-python-with-uk-census-data-6e571c60ff4
https://stackoverflow.com/questions/46775667/plotting-uk-districts-postcode-areas-and-regions
https://realpython.com/python-folium-web-maps-from-data/
https://stackoverflow.com/questions/9170838/surface-plots-in-matplotlib
https://stackoverflow.com/questions/32073498/check-if-file-is-readable-with-python-try-or-if-else
https://stackoverflow.com/questions/54113067/3d-scatterplot-with-strings-in-python